Be Conservative: Enhancing Failure Diagnosis with Proactive Logging

نویسندگان

  • Ding Yuan
  • Soyeon Park
  • Peng Huang
  • Yang Liu
  • Michael Mihn-Jong Lee
  • Xiaoming Tang
  • Yuanyuan Zhou
  • Stefan Savage
چکیده

When systems fail in the field, logged error or warning messages are frequently the only evidence available for assessing and diagnosing the underlying cause. Consequently, the efficacy of such logging—how often and how well error causes can be determined via postmortem log messages—is a matter of significant practical importance. However, there is little empirical data about how well existing logging practices work and how they can yet be improved. We describe a comprehensive study characterizing the efficacy of logging practices across five large and widely used software systems. Across 250 randomly sampled reported failures, we first identify that more than half of the failures could not be diagnosed well using existing log data. Surprisingly, we find that majority of these unreported failures are manifested via a common set of generic error patterns (e.g., system call return errors) that, if logged, can significantly ease the diagnosis of these unreported failure cases. We further mechanize this knowledge in a tool called Errlog , that proactively adds appropriate logging statements into source code while adding only 1.4% performance overhead. A controlled user study suggests that Errlog can reduce diagnosis time by 60.7%.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud

As high-performance computing (HPC) systems continue to increase in scale, their mean-time to interrupt decreases respectively. The current state of practice for fault tolerance (FT) is checkpoint/restart. However, with increasing error rates, increasing aggregate memory and not proportionally increasing I/O capabilities, it is becoming less efficient. Proactive FT avoids experiencing failures ...

متن کامل

Acute renal failure due to renal lymphomatous infiltration as the initial manifestation

  A male patient with acute renal failure (ARF) due to large B-cell non-Hodgkin lymphoma infiltration of kidney is presented. The diagnosis was suspected because of coincidence of ARF and tumor lysis syndrome non-responsive to conservative renal therapies. A renal biopsy confirmed diagnosis and appropriate chemotherapy led to complete improvement of renal function.  

متن کامل

Health Care Failure Mode and Effect Analysis: A Useful Proactive Risk Analysis of Nutrition and Food Distribution in Mashhad Qaem Hospital’s Women’s Surgery Ward in 2013

INTRBackground and Objectives: Good medical nutrition therapy (MNT) is crucial to inpatients' health and treatment, and is part of routine hospital cares. Surgery ward is a highly danger-prone section in any hospital. The present study was conducted for a proactive risk analysis of nutrition and food distribution in Mashhad Qaem Hospital’ Women’s Surgery Ward in 2013 through health care failure...

متن کامل

Use of FMEA analysis to reduce risk of errors in prescribing and administering drugs in paediatric wards: a quality improvement report

OBJECTIVE Administering medication to hospitalised infants and children is a complex process at high risk of error. Failure mode and effect analysis (FMEA) is a proactive tool used to analyse risks, identify failures before they happen and prioritise remedial measures. To examine the hazards associated with the process of drug delivery to children, we performed a proactive risk-assessment analy...

متن کامل

Using Message Semantics for Fast-Output Commit in Checkpointing-and-Rollback Recovery

Checkpointing is a very effective technique to ensure the continuity of long-running applications in the occurrence of failures. However, one of the handicaps of coordinated checkpointing is the high latency for committing output from the application to the external world. Enhancing the checkpointing scheme with a message logging protocol is a good solution to reduce the output latency. The ide...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012